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Abstract. The local opening of DNA is an intriguing phenomenon from a statistical physics point of 
view, but is also essential for its biological function. For instance, the transcription and replication of our 
genetic code can not take place without the unwinding of the DNA double helix. Although these biological 
processes are driven by proteins, there might well be a relation between these biological openings and the 
spontaneous bubble formation due to thermal fluctuations. Mesoscopic models, like the Peyrard-Bishop- 
Dauxois model, have fairly accurately reproduced some experimental denaturation curves and the sharp 
phase transition in the thermodynamic limit. It is, hence, tempting to see whether these models could be 
used to predict the biological activity of DNA. In a previous study, we introduced a method that allows 
to obtain very accurate results on this subject, which showed that some previous claims in this direction, 
based on molecular dynamics studies, were premature. This could either imply that the present PBD 
should be improved or that biological activity can only be predicted in a more complex frame work that 
involves interactions with proteins and super helical stresses. In this article, we give detailed description 
of the statistical method introduced before. Moreover, for several DNA sequences, we give a thorough 
analysis of the bubble-statistics as function of position and bubble size and the so-called i-denaturation 
curves that can be measured experimentally. These show that some important experimental observations 
are missing in the present model. We discuss how the present model could be improved. 

PACS. 87.15. Aa Theory and modeling; computer simulation - 87.15. He Dynamics and conformational 
changes - 05.10.-a Computational methods in statistical physics and nonlinear dynamics 
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. ^ ■ 1 Introduction sites could be predicted by thermally induced bubbles in 

absence of any proteins |l|2|3j . 

^ j—| The process of DNA denaturation has intrigued both bi- Experimentally, the thermally induced denaturation 
O |. ologists as statistical physicists. Large openings, the so- can be monitored as the breaking of the base-pairs is ac- 
' called DNA bubbles are supposed to allow the formation companied with a large increase of UV absorbance near 
of some specific DNA structures, such as the T-loop that 260 nm. In fact, the UV absorbance measures the reduc- 
stabilizes the end of the chromosomes. The opening of the tion of base-pairing and -stacking when the DNA molecule 
$— i DNA double helix is also a mandatory step for the tran- denaturates. Using this technique, it was found that large 
scription and the replication of the genetic code. In ad- synthetical fabricated homopolymers denaturate suddenly 
dition, the bonds between bases on opposite strands can within a very small temperature interval [4] . This indicates 
break due to thermal fluctuations which can occur even that the process resembles a true first order phase transi- 
at room or physiological temperatures. These thermally tion. On the other hand, natural heterogeneous DNA poly- 
induced DNA bubbles can be several base-pairs long and mers denaturate in multiple steps and the shape of this 
tend to increase at higher temperatures, which eventu- denaturation curve is highly sensitive to the sequence [5]. 
ally results in the complete denaturation or the melting It is known that this process is not only determined by 
of DNA. An intriguing question we could ask ourselves is the fraction of strong (GC) or weak (AT) bonds. The se- 
how the formation of bubbles depend on the base-pair spe- quence specific order is also important. Specific sequences 
cific sequence and how thermally induced bubbles relate can reveal a high opening rate despite a high fraction 
to biophysical DNA unwinding mechanisms that are in- of GC base-pairs [B]. Besides the already mentioned UV 
volved in the transcription and replication. Although these absorbance experiments, many ingenious techniques have 
biological processes are driven by proteins, the intrinsic been devised to study the denaturation process and the 
fluctuations of DNA itself might play an important role, statistical and dynamical properties of DNA bubbles in 
Hence, one could even question whether biological active general. For instance, Raman vibrational spectroscopy [7|8I 
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neutron scattering |9j , fluorescent correlated spectroscopy f 
and Sl-nuclease cleavage [1] have recently put forward as 
promising experimental tools to gain insight in the com- 
plex mechanism of DNA denaturation. 

In general, despite this significant progress, the experi- 
mental techniques reveal only indirect information. Hence, 
complementary computational and theoretical studies are 
often a requisite to complete the interpretation of exper- 
imental data. This is, however, difficult due to the astro- 
nomical large number of atoms that are needed to describe 
solvated DNA. Besides the number of atoms of DNA itself, 
a sufficiently large number of water molecules and counter 
ions should be included. Any full-atom approach is hence- 
forth limited to very short DNA sequences and, for the 
longest sequences that can be studied, meaningful bubble 
statistics cannot be obtained. This has created need for 
mesoscopic theoretical models that allow to study long 
DNA sequences of hundreds or even thousands of base- 
pairs [ll|12|13|14|15|16j . While most of these models try 
to mimic the system by an Ising-like model, the Peyrard- 
Bishop-Dauxois model [15ll6j (PBD) relies on a continu- 
ous approach using an effective force-field as function of 
the base-pair separation. Although more complicated than 
the Ising type models, the PBD model has the advantage 
that it can describe the DNA sequence in a more detailed 
manner than just a simple array of open and closed states 
and it allows to study dynamics as well. An important 
essence of the PBD model is the nonlinear stacking in- 
teraction which reproduces the experimentally measured 
sharp phase transition of long homopolymers [16j . More- 
over, the model, parameterized for heterogeneous DNA 
chains, has given accurate results for denaturation curves 
of short heterogeneous DNA sequences [17] . Although the 
PBD model is a very strong simplification of the actual 
DNA molecule in solution, the qualitative and even quan- 
titative agreement with numerous experimental findings 
have given confidence to this model and to its theoretical 
results for which yet no direct experimental information 
is available. 

It were these findings that inspired Choi et al. [l|2j to 
compare the signal of SI nuclease cleavage experiments 
with the formation of bubbles of a certain size obtained 
from molecular dynamics (MD) simulations of PBD model. 
The detection of bubbles at a certain size requires the 
identification of configurations that contain series of con- 
secutive open base-pairs which is very difficult to accom- 
plish experimentally. Still, as argued in pQ, the SI nuclease 
enzymes can selectively cleave the large temporary open- 
ings while leaving the smaller openings intact, hindered 
by their own physical size. The amount of cleavages at 
certain positions in the DNA chain results in a signal that 
becomes visible after a certain time of incubation (about 
45 min. pQ). The obtained SI nuclease signal showed a re- 
markable correspondence with the calculated probability 
profile for bubbles containing ten or more base-pairs from 
the MD simulations of the PBD model pQ. Moreover, both 
experimental and theoretical graphs showed clear dom- 
inant peaks around the Transcription Start Site (TSS) 
where the biological transcription is initiated. A similar re- 



sult had been reported by Benham et al. [18|19|20|21|2"2"] 
who also found a connection between bubble formation 
and regulatory loci using a theoretical model. However, 
there are two crucial differences between the work of Ben- 
ham et al. and Choi et al. First, the methodology of Ben- 
ham et al. is specified to detect very large openings upto 
100 base-pairs in kilobase sequences, while the work of 
Choi et al. investigates much smaller openings ~ 10 in 
sequences of the order ~ 100 base-pairs. The second and 
most important difference is that work of Benham studies 
the bubbles in vivo which includes torsional effects that 
are generated by other molecules. The apparent evidence 
of Choi p] suggested that spontaneous bubbles in vitro al- 
ready bear the signature of biological activity. A remark- 
able result that was summarized by the statement: DNA 
directs its own transcription pQ. 

Unfortunately, this statement had to be reconsidered 
due to more accurate results by us [3] using a direct in- 
tegration method that is orders-of-magnitude faster than 
MD. An important difficulty with MD or Monte Carlo is 
that large bubbles appear only seldom so that the statisti- 
cal significance can be questioned even for very long sim- 
ulation periods. Our accurate results did not support the 
previously found results at some crucial points. As in [1], 
they indicated that bubbles might appear more easily in 
the biological active sites due to its higher content of AT 
as compared to a random sequence. However, contrary 
to [T], the most dominant peak did not appear at the TSS 
for the sequences under study nor did the promoter se- 
quences have a much higher opening profile as compared 
to biologically inactive sequences. Hence, the statistical 
information on bubbles obtained by the PBD model was 
found to be insufficient to make very accurate predictions 
on transcription start sites or to discriminate between pro- 
moter sequences and biologically inactive sequences as was 
suggested before [112] . This leaves open the following pos- 
sibilities: (i) either the transcription sites cannot be pre- 
dicted by the information on thermally induced bubbles 
alone but require more complex interactions including, for 
instance, superhelical stress, or (ii) the bubble hypothesis 
of Choi et al. still holds, as suggested by the SI nucle- 
ase experiments, but a more accurate theoretical model is 
needed to support these findings. 

The main subject of this article is to give a detailed 
description of the direct integration method introduced 
in [5] and to show some examples of the calculated bub- 
ble statistics for some biologically active and inactive se- 
quences. We will also investigate the validity of the PBD 
model by applying this method to calculate quantities 
that allow a more direct comparison with experiments. 
This article is organized as follows: we will first give a 
short introduction to the PBD model in Sec. [H followed 
by a theoretical discussion on what we will call the dou- 
ble stranded DNA ensemble (dsDNAE) in Sec. H The 
latter is needed to give meaningful results when apply- 
ing the PBD model to finite chains. Then, in Sec. [H we 
give some important definitions concerning the bubble 
statistics of DNA expressed in microscopic terms such 
that it can be calculated by computer experiments. In 
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Sec. Owe introduce the direct integration method includ- 
ing all the technicalities involved. This derivation results 
in an algorithm that implies a repetitive numerical in- 
tegration scheme using a Newton-Cotes rule. The effi- 
ciency of several Newton- Cotes schemes, such as rectan- 
gular, trapezoidal, Simpson's |-rule, Boole's rule, and 11- 
point Newton-Cotes rule, are examined and compared in 
Appendix [A"l In Sec. [5] we show some numerical results of 
the bubble probability profiles of a biologically active pro- 
moter sequence and two artificial Fibonacci sequences. We 
confirm the previous findings: there is no enhanced open- 
ing at transcription start sites or at promoter sequences 
in comparison to biologically inactive sites and sequences 
that have a similar (local) AT content. Then, in Sec. [7] 
we investigate the validity of the PBD model using the di- 
rect integration method to calculate Z-denaturation curves 
which can be measured experimentally by the recently 
introduced quenching technique |23l24l2"5"] . These results 
clearly indicate that some essential ingredients are miss- 
ing in the present PBD model. This implies that the PBD 
model should be improved and that the bubble hypothesis 
of Choi et al. could still hold when an 'ideal theoretical 
model' is considered. In Sec. [H we end with a general dis- 
cussion and make some suggestion that could lead to an 
improved theoretical model. 



the stacking interaction drops from K 1 = K(l + p) down 
to K — K whenever either y k or y k —i becomes signifi- 
cant larger than a -1 . It is thanks to this additional term 
that the observed sharp phase transition in denaturation 
experiments [1] can be reproduced. It is important to note 
the + sign in Eq. ((2]). This makes the stacking potential 
W(yk,Vk-i) n ot a simple function of the relative distance 
\Vk — yk-x\- It was found that, after replacing e ~ a ^ Vk+Vk - 1 ' > 
with e - a \y><-yk-i\ m ^ the denaturation transition 
becomes continuous again as in the original PB model |26j . 
However, Eq. ((2|) is surely not the only possible possi- 
ble potential that can reproduce the sharp transition. Re- 
cently, an alternative potential W(y k , J/fc-i) was suggested 
in |27j which also seems to generate a sharp denaturation 
and only depends \yk — Vk-i\- This shows that reproduc- 
ing experimental curves alone is definitely not enough to 
uniquely determine the effective potentials. Interpretation 
of the physical mechanism that lead to the sharp denat- 
uration transition is a prerequisite for the justification of 
the effective models. The discussion of this mechanism is 
definitely not completely settled, but the argumentation 
that relics in the PBD model seems very plausible, as the 
/9-term mimics the effect of decreasing overlap between tt 
electrons when one of two neighboring base move out of 
stack. 



2 The PBD model 

The PBD model reduces the myriad degrees of freedom of 
DNA to a one-dimensional chain of effective atom com- 
pounds describing the relative base-pair separations yk 
from the ground state positions. The total potential en- 
ergy U for an TV base-pair DNA chain is then given by 

JV 

U(y N ) = VM + Mvu) + W(y k ,y k -i). (1) 

k=2 

Here, y N = {yk} denotes the set of relative base pair po- 
sitions and V k and W are the two PBD-potential energy 
functions given by 

V k (y k ) = D k (e- a *y* - l) 2 (2) 

W(y k ,yk-i) = ^(l + pe-^+y^Yyk-yu^) 2 

The first term V k is the on site Morse potential describing 
the hydrogen bond interaction between bases on opposite 
strands. D k and a k determine the depth and width of the 
Morse potential and are different for the weak AT and 
strong GC base-pair. The stacking potential W consists 
of a harmonic and a nonlinear term. An important reason 
for the success of this model lies in the p-term which was 
introduced in [16) as an improvement upon the original 
Peyrard-Bishop (PB) model [15] . This original PB model 
can be retrieved by taking p — 0. The precise analytical 
shape of W(y k , yk-i) hi Eq. @ is not crucial. What is im- 
portant is that for p > 0, the effective coupling constant of 



After modeling homogeneous DNA, Campa and Gi- 
ansanti generalized the PBD model for the heterogeneous 
case |17|28j . The in total 7 parameters K = 0.025 eV/A 2 , 
p = 2, a = 0.35 A" 1 , D w = 0.05 eV, D s = 0.075 eV, 
a w = 4.2 A -1 , a s = 6.9 A -1 , were derived by fitting to 
experimental denaturation curves of short heterogeneous 
DNA segments. The subscripts w and s refer to the type of 
base-pair at site k in Eq. ([2]). Here, D w and a w are used 
for the weak AT base-pairs and D s and a s are used for 
the strong GC base-pairs. The ratio between D w and D s 
reflects the ratio between the number of hydrogen bonds 
forming the AT and GC base-pair bonding. In fact, the 
reason to fix this ratio is not really justified as the depth 
of the Morse potential does not only reflect the hydrogen 
bond linking (which is in the order of 0.2 eV per hydrogen 
bond) , but also the repulsive interactions of the phosphate 
groups and the effect of the solvent. Still, the absolute and 
relative magnitude of the effective weak and strong inter- 
actions seem to be more or less correct as this parame- 
terization could reproduce the experimental denaturation 
curves of several short DNA sequences as tested in |17I28) . 



Despite these accomplishments, it is also important to 
realize the limitations of the model. The PBD model treats 
the A and T bases and the G and C bases as identical ob- 
jects. The stacking interaction W(y k ,y k -i) is also inde- 
pendent of the nature of the bases at site k and k — 1. Ex- 
perimental measurements 29 30 31 and theoretical calcu- 
lations 32 33 34 35 36 have shown that these are rather 
crude approximations. Future work might aim to improve 
upon this. 
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3 The dsDNA ensemble 



In this section we will assert the need of special ensemble 
that we will call the double stranded DNA ensemble (ds- 
DNAE) and we will give its mathematical definition. The 
reason that we will not use the full NVT or NVE ensemble 
is because the results based on the PBD model have not 
much meaning in these ensembles whenever finite DNA 
chains are considered. The original papers using the PBD 
model were all performed in the thermodynamic limit of 
an infinite DNA chain where this problem does not ap- 
pear. It is in this limit that one can show, using the trans- 
fer integral technique [37] , that the uniform PBD-DNA se- 
quence undergoes a very sharp phase transition [38] upon 
heating, which is first order except in a cross over region 
near the transition temperature that is so narrow that it 
is not accessible to experiments. The difficulty of finite se- 
quences is that PBD model basically represents a single 
DNA chain in an infinite solution. Hence, whenever the 
dsDNA completely separates, the two strands are free to 
go to very large separations without cost of energy due to 
the plateau of the Morse potential. In experiments, where 
the amount of solvated DNA is not infinitely diluted, this 
effect is counterbalanced by the hybridization mechanism 
where two single stranded chains in solution come together 
and match their complementary bases. This implies that, 
per definition, the PBD model cannot reproduce the ex- 
perimental data, which are based on finite concentrations, 
using equilibrium statistics in the full phase space. A con- 
finement of the phase space is always necessary. These 
can be done hiddenly using a series of reasonable short 
MD [16|39|27j or Monte Carlo [40] simulations starting 
from a certain distribution of initial configurations. Here, 
the finite simulation length prohibit the boundless explo- 
ration of the completely separated states. However, this 
strategy will naturally generate results that depend on the 
choice of initial conditions and the simulation length which 
is not completely under control especially at tempera- 
tures near the melting transition [16139] , Alternatively, 
one could restrict configuration space by adding an in- 
finite wall such that yk for all k cannot exceed a certain 
maximum value [41] or by adding a small positive slope to 
the plateau of the Morse potential [38] . These approaches 
still allow for complete denaturation and recombination of 
the two strands, but prevent separations of very large dis- 
tances. This recombination, however, is quite artificial as 
the one-dimensional model does not allow for misfolding, 
the creation of bulge-loops 42J or the recombination with 
a different strand in solution. Therefore, we chose to focus 
to these configurations only that belong to the dsDNAE 
that we will introduce here. In microscopic terms, a con- 
figuration {yk} is called a double stranded DNA (dsDNA) 
molecule when yk < £, for at least one k € [1 : N] with 
£ the opening threshold definition. Similarly, a configura- 
tion is completely denaturated whenever yk > £, for all k. 
All configurations assigned as dsDNA together with their 
corresponding Boltzmann-weight comprise the dsDNAE. 



The statistical average of a certain function A(y N ) 
the full phase space is standardly defined as 



(A) 



Jdy N A(y N )g(y N ) 

fdy N Q(y N ) 



(3) 



with dy N = dywdyN-i ■ ■ ■ dj/i, g = e _/3C/ the probability 
distribution density, and (3 = l/ksT with T the temper- 
ature and ks the Boltzmann constant. In order to define 
the ensemble average in the dsDNAE we introduce follow- 
ing characteristic functions that indicate whether a certain 
base-pair is open or closed. 



h(y k ) = 0(y k -O, k {yk) = e^-y k ) 



(4) 



Here #(•) equals the Heaviside step function. 9 k equals 1 
if the base-pair is open and is zero otherwise. 9k is the 
reverse. Now, the ensemble average oiA{y N ) in dsDNAE 
can be expressed as a weighted average using the weight 
function /j,(y N ): 



(A(y N )) 



with 



{A{y N )») 
(A*) 



N 

n« 

k =i 



(5) 



(6) 



To shorten the notation we have dropped the y k depen- 
dencies. In Eq. ([6]), fi = 1 except when all bases are open; 
then (j, = 0. The dsDNAE removes all difficulties con- 
cerning the unnormalizability of the full phase space equi- 
librium distribution. Besides the opening threshold defini- 
tion £, it does not add any new (hidden) parameters to the 
PBD model as in previous examples. At temperatures suf- 
ficiently below the denaturation transition, the dsDNAE 
gives a good representation of the actual experimental sit- 
uation where only a fraction of the DNA is in the single 
stranded state. It is reasonably simple to use MD in the 
dsDNAE using a biasing-potential, e.g. [3] 



^ bIaS (2/min) = 



_ / (Vrnm ~ £) 6 ^ 2/min > £ 



otherwise 
with y min = Mm[{y k }} 



(7) 



This bias yields an additional force to the system that is 
always zero except when the dsDNA is at the point of 
complete denaturation. Then it gives a strong repulsion 
to the last closed base to prevent the complete opening of 
the whole molecule. Although, MD is certainly much less 
efficient than the direct integration method expressed in 
Sec. MD using the biasing force ([7]) can still be useful for 
calculating properties that do not allow the factorization 
necessary for the integration method or dynamical prop- 
erties. At higher temperatures, the contributions of sin- 
gle stranded DNA, to e.g. UV absorbance, can no longer 
be neglected. Luckily, recent experimental techniques al- 
low to selectively subtract the contributions of the single 
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stranded molecules to the signal [23|24|25] such that, ef- 
fectively, the dsDNA signal can be obtained. Hence, also 
at higher temperatures, the theoretical PBD calculations 
using dsDNAE can be compared with experimental re- 
sults. 

It is an interesting mathematical problem why the 
complete separation does not disturb the thermodynamic 
case. In fact, this can be understood invoking one-dimensional 



any bias had much more difficulty to determine the denat- 
uration temperature due to huge variations in the melting 
region despite the use of very long sequences upto 16384 
base-pairs [39] . 



random walk theory. This reveals that, for a fixed configu- 
ration of the infinite DNA chain, one should always meet 
a closed base-pair when making a walk in one direction 
along the chain Q. Hence, fi is always 1 for the infinite 
case and, thus, the infinite chain remains in the dsDNAE 
at all times. It is important to note that, therefore, the 
constraint to keep always one base-pair closed, does not 
destroy the phase transition. On contrary, the additional 
constraint allows to study thermodynamic limit using fi- 
nite approximants in a much more controlled way. Fig. [T] 
shows the denaturation curves of finite homopolymers of 
increasing length. The results was obtained by the direct 



4 denaturation curves and bubble probability 
matrices 

Using the definitions of 23 24 25 we can call / the fraction 
of open base-pairs and p the fraction of open molecules. 



With the use of Eqs. 
ematical expressions 



([H we can give the following math- 



/ = 



N 

E 

k=l 
N 



n 



(8) 







/ GC50/-< 


AT 100 ft 1 


/ GC100 — ft I 




AT200 / 


/ GC200-" 1 / 




AT400 - — / / 


/ GC400 — // / 


/ ATJ0___----' 









Temperature (K) 

Fig. 1. I (see Eq. ©) as function of temperature for homoge- 
neous AT and GC chains of different lengths. One can clearly 
see that, when the length increases, the cur ves resemble more 
and more a sharp step function. For all sequences free bound- 
ary conditions were applied. 

integration method of Sec. [5] but could as well been ob- 
tained using MD with the bias potential ([7|). The denat- 
uration curves of the 400 GC and 400 AT base-pair se- 
quences resemble already closely the discontinuous step 
function, that would result from an infinite chain, and 
allow to estimate the denaturation temperatures quite ac- 
curately. In contrast, previous analysis using MD without 



1 This is a result from Polya [52] who proved that one- and 
two-dimensional random walks always return to their origin. In 
fact, any site will be visited after an infinite number of steps. 
The probability to return after an infinite number of steps to 
the origin is associated with the Polya's number. This number 
is 1 for one- and two-dimensional systems, but less than one 
for higher dimensions. 



Moreover, we introduce I 123124125 ] as the fraction of open 
base-pairs provided that the molecules is in the double 
stranded state 

1 N 

k=l 

Eq. ([8l9j) obey the relation I = (f-p)/(l-p). The quantity 
I is sometimes called the average fractional bubble length. 
However, this is not completely true as more than one bub- 
ble may occur simultaneously in the same sequence. For 
the infinite case, we have / = / and p — as explained 
in Sec. [3] However, we cannot reproduce the experimental 
f(T) and p(T) curves for finite chains as we have, strictly 
speaking, f(T) — p(T) — 1 at all temperatures in the 
PBD model. Therefore, we will focus on the behavior of 
l(T) which can be measured by the quenching technique of 
Zocchi and co-workers [23 24 25j . Indirectly, /(T) could be 
obtained from 1{T) using the phenomenological approach 
of Campa and Giansanti [17]. This approach, however, re- 
quires two additional parameters that have to be fitted to 
experiments. Therefore, we believe that the calculation of 
1{T) gives the most direct comparison with experimental 
data. 

Of course, the experimental UV absorbance signal can- 
not be literally related to the fraction of open base-pairs as 
it is not a binary type measurement that detects whether 
the base is open or closed. Moreover, the theoretical def- 
inition of 'open' and 'close' is a bit ambiguous as it de- 
pends on the choice of opening threshold £. Still, it is 
known that the UV absorbance changes quite abruptly 
when bases move out of stack, which validates the 6*-like 
expressions © and Moreover, it was found that, at 
least, the qualitative aspects of the theoretical denatura- 
tion curve are not too sensitive to £ when chosen within 
a reasonable interval (~ [1 A: 2 A] ) . Nevertheless, the 
theoretical definitions ([8]) and ([9]) are, not the only ones 
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proposed in literature. In Refs. [40|43j another functional 
form of Eq. §S§ was used 



1 N 



(10) 



*:=i 



We believe, however, that Eq. (|T0|) should be considered 
as imprecise as the UV signal is almost a binary indication 
of the stacking state of a base pair and, hence, cannot be 
related to the mean position (y k ) of the bases. 

Besides denaturation curves, the statistical method in- 
troduced in Ref. [3] allows to study bubbles of a given size. 
The importance to study bubbles of a given size was sug- 
gested by Choi et al. |H2j as its signal could be related to 
SI nuclease cleavage experiments and possibly could tell 
more about its biological function than the mean (y/.) or 
the probability of opening (9(y k — £))■ Before giving the 
definition of, what we call, the bubble probability matrix, 
we will need to introduce the following auxiliary function: 

9 l k " l] = fc _™0 fe+ m +1 Jl fe , for m even 
k'=k—^+l 

~ " 6 k , m+i Yl 9 k , for m odd (11) 



k'=k- 



which is 1 (0 otherwise) if and only if k is at the center 
of a bubble that has exactly size m. For even numbers 
it is a bit arbitrary where to place the center, but we 
defined it as the base directly to the left of the midpoint 
of the bubble. The bubble probability matrix Pbub(fc,m) 
is, now, defined as the probability to have a bubble of size 
m centered at base-pair k provided that the molecule is 
part of the dsDNAE. Hence, 



Pbub(k,m) 



(12) 



In principle, Pb u b(k,m) contains all the information on 
the bubble statistics in a DNA sequence. Still, it is use- 
ful to calculate other quantities as well. From physical and 
biological perspective, it might be useful to know the abil- 
ity to participate in bubbles. Therefore, we introduce the 
Ppnrt{k, m) probability which is the probability to partic- 
ipate in a bubble of at least m sites. 



Ppart(k,m) = 



{m': even} fe+m'/2— l 

E 

k' — k — m'/2 



E 

m f >m 

{m 7 : odd} 

E 



Pbuh(k',m') 



fc+(m'-l)/2 

E 

fc' = fc-(m'-l)/2 



Pbub(fc',m')(13) 



This quantity is less mathematically stringent as it is in- 
dependent of where you assign the position of the bubble. 
Note that this quantity is still somewhat different from the 
projection in Ref. [3] where each bubble is still associated 
to one base-pair position only. In variance with Pbub(fc, 1), 



the bubble participation probability P pa rt(fc, 1) is directly 
related to the simple opening. Hence, P paT t(k, 1) = (9 k ) ^ 

Pbub(M). 



5 The direct numerical integration method 

that appear in Eq. © 



The two quantities (9 k ) ^ and (9 l k 

and (|12[) can be expressed using partition function inte- 
grals: 

,n y Z 9 k ~ Z n 



Z-Z n 
Z-Zn 



(14) 



which arc defined by: 



z = 


1 dy N e 


-PU{y N ) 




z e k = 


f dy N e 


-PU{y N )Q k 




Z < m] = < 


f dy N e 


-/3U(y N ) e [rn] 




Zn = 


f dy N e 




(15) 



In Eq. (|Ti|). we used the fact that (9 k ) 2 = k and 9 k k = 0. 
Note that Z, Zg k , and Zn are infinite, but the differences 
Z — Zn and Zg k — Zn are finite and well defined. 

Now, as all integrals Zx are of the factorizable form 

Zx = fdy N a i x ) (y N ,y N „i)...a x V> {y3,y2)ax\y2,yi) we 
can use following iterative scheme to determine the Zx 
integrals: 



4 2) 0/2) 



J3) 
-X 



0/3) = 



z x \Vn) 



Zx = 



dyi a i x v> {y 2 ,yi) 

dy2 a ( x\y3,y2)z { xHy2) 

dy N -i a^xHyN^N-^Zx^iVN-x) 
dyN z^iyN) 
Mi 



(16) 



The calculation of z x '(y k ) for a discrete set of n gr id 
values y k requires only n^ rid function evaluations when- 



ever z 



(fc-i) 



is known. Hence, a total of N 



id function 

evaluations are required instead of n^ lid which is a huge 
improvement. 

An alternative technique was introduced in Ref. |41j 
where the (y k ,y k -i) kernels are expanded into a proper 
basis-sets. After this expansion, the integrals, like in Eqs. (fT5 
turn into simple matrix multiplications which can be eval- 
uated efficiently. It was found that performance of such a 
method depends strongly of the right choice of basis-set 
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functions. The implementation of this method is, there- 
fore, probably a bit more involved than the direct inte- 
gration scheme of Eq. (fT6|) . Most likely, this method will 
be more efficient to calculate quantities as (y^) that are 
written as averages of continuous functions, than, for in- 
stance, (9k) which involves a discontinuous step-function. 
The latter would require a much larger expansion when 
using continuous basis-set functions. 

(k) 

The factorization of Zx into a x kernels is generally 
not unique. Our choice for for the partition function 
Z is the following 



,0) 



(Vk,Vk- 



e -/3[W( 1 / fc ,i/fc-i)+V fc - 



l(Vk 



')] if k ^ 

i)+V N (y k ] j£ _ 



and for and 



<*n 0/k> 2/fc-i) = a (k) (y k ,yk-i)8k(yk)8k-i(yk-i) (18) 
( a (k) (yk,yk-i) i(k^q,q + l 

ag v '(yk,Uk-i) = < a {k) (yk,yk-i)8k(yk) if k = q 

{ a {k) (yk,yk-i)Ok-i(yk-i) \ik = q+l 

where we use again that Q\ = 9k- Similar expressions can 
be derived for a^l, . 

£)["»] 
Vq 

In order to perform the numerical calculation, wc need 
to define some proper cut-offs where we can stop the inte- 
gration. It is natural to stop the integration whenever the 



weight of a certain configuration g = e 



drops be- 



low a certain threshold value e. It is clear that the energy 
diverges and, hence, g vanishes whenever for a certain k 
the position y k takes a very large negative value or when 
the relative distance \y k — yk-i\ becomes very large. To be 
in safe limits, we calculate the integration cut-offs for the 
pure AT-chain. If we set the integration boundaries such 
that outside this domain we have g < e for this sequence, 
it will also hold for the pure GC or heterogeneous chain. 
The lower limit L of yk results from 



e -0V w {L) <e ^ L < 



1 



In 



/|lne| 



(19) 



To define the maximal distance d between two neighbors 
we assume that pe~ a ( yk+yk - 1 '> is almost zero. This yields 



-f3±Kd 2 



< e =► d > 



'2|lne| 

(3K 



(20) 



If \Vk ~ Vk-l] exceeds the value d at any k, the proba- 
bility distribution g(y N ) must have decreased below the 
threshold e so that we can stop the integration. The upper 
limit R is obtained as follows. Again neglecting the anhar- 
monic p-tcrm, the configuration with the lowest stacking 
energy ^2iW(yk,yk-i) and with a maximal total stretch 
\vn ~ yi\ — S is obtained whenever equidistant positions 
are taken such that \yk — yk-\\ — S/(n — 1). Then, the 
total stacking energy equals (N — l)hK(S/(N — l)) 2 = 



\KS 2 /(N-\) < \KS 2 /N. Therefore, the maximum dis- 
placement of each base, for configurations that belong to 
a double stranded configuration, and with g(y N ) > e, can- 
not exceed R given by 



S with S defined by e 



-P±KS 2 /N _ 



Nd. 



(21) 



This completes the set of cut-off values. In principle, the 
cut-off d is not strictly necessary as L and R are suffi- 
cient to start a numerical approach. However, the cut-off 
d is useful as it decreases the computational expense con- 
siderably. To summarize, via Eq. (|19H21[) we have defined 
(nrce cut-off values which restrict the configuration space 
to L < yk < R and \yk — 2/fe-i| < d for all k. Any configu- 
ration outside this domain must have a Boltzmann weight 
g below e and can, hence, be neglected for the numerical 
integration. 

The integration boundaries increase only slightly upon 
decreasing e. Therefore, we took e = 10~ 40 which is much 
smaller than actually needed for our required accuracy [3] . 
As we take a discrete grid with spacing Ay, the values d, L, 
and R must be adjusted to this grid. That is, we require 
that I d = d/Ay, I L = (C~ L )/Ay and I R = (R- £)/Ay 
should all be integer values. There is another restriction 
for the allowed values of Ir which depends on the spe- 
cific numerical integration method and will be discussed 
in Appendix [Al Coming back to Eqs. (fl"5]) . we actually 
no longer intend to calculate Z, Zg k , and Zjj, which are 
infinite, but Z(R), Zg k (R), and Zjj(R) which have a lin- 
ear dependence as function of R. However, the differences 
Z(R) — Zji(R) Zg k (R) — Zjj(R) converge very rapidly to 
a constant value for R — > oo. 

As the same function evaluations are repeated over and 
over again in this integration scheme (|16p , it is efficient to 
store following values at the start of the algorithm using 
two matrices and defined as: 



M^ /s) = exp(-0W(L + i Ay, L+(i + j)Ay)) 
x exp(-j3[V w/s (L+(i+j)Ay)]) 



(22) 



which are basically the values of two possible a' fe ' (yk, y k -i) 
functions (|17| on the grid. Then, by defining the vector 



(23) 



X f(i) = zf(L + iAy), 
the basic operation in Eq. (fT6l) 



-x 



(yk) 



dyk-i aP(y k ,yk 



y)z x (Vk-i) 



can be recast in following numerical operation 

= Ay^fMt^t'Hi+i) (24) 



where M {k X) is either or M ( % f of Eq. ([22]) depend- 

ing on the type of base-pair k — 1. Of course, like the 
end kernel a^ N '(yN,VN-i) i n Eq- (jl7p . the last matrix in 
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Eq. (|24|) should include the additional factor exp(— /3Vn(i/n)). 
The vector fj depends on the specific Newton-Cotes in- 
tegration method. An analyses of different Newton-Cotes 
schemes is given in Appendix [Al 

The algorithm starts by taking the first vector Xx (*) = 

I and, then, iteratively apply Eq. (|24)) . In order to obtain 

the full vector Xx\i)-> we nee d to perform a loop where i 
runs either from till II , from II till Ir, or from till 

II + Ir depending on whether X allows yf. in Eq. (fTBj) to 
take values over the closed, open, or full domain, respec- 
tively. At each i, we perform an inner loop over j. Also 
Dk-i might take values in the closed, open or full domain 
and its value is assigned by the integer i+j. Hence, similar 
to i we can write that g < i + j < h where g can be either 
or II and h is either /j, or II + Ir. As j ~ \yu — Dk-x\ 
is also restricted by |j| < Id, the inner loop over j runs 
from MAX[-I d , g-i] till MIN[/ rf , h-i]. After the double 
loop over i and j, we increase k by one and repeat the 
procedure. This basically defines the complete numerical 
algorithm, but still leaves open what one should take for 
the vector fj . This is discussed in Appendix |A] in which 
we consider different integration schemes. 



6 bubble probability matrices 

Now we have introduced the mathematical definitions con- 
cerning the bubble statistics in Sec. [4] and derived the 
numerical method to calculate these properties in Sec. [5] 
and Appendix [A] we will apply this method to specific 
sequences. In Ref. |3J, we calculated the bubble probabil- 
ity matrix (|12|) for de adeno-associated viral P5 promoter 
(AAVP5) whose sequence is shown below 



AAVP5: 5'- GTGGCCATTTAGGGTATATATGGCCG 
AGTGAGCGAGCAGGATCTCCATTTTG 
ACCGCG AAATTTG AACG-3' . 

The TSS is shown by an underscore. 

In Fig. [2j we show the same results as Ref. [3] for a 
slightly different threshold value (£ = 1 A instead of 1.5 
A) together with the bubble partition matrix (fT3j) . As we 
are not interested in the boundary effects, we replicated 
the chain at both ends, but only computed the statis- 
tics for the middle chain. This eliminates the effects of 
the free ends, which, otherwise, would yield very large 
opening probabilities at boundary sites [3] . We calculated 
the bubble probability matrix (fT2")l up to bubbles of size 
m = 50 (only up to 30 is shown in Fig. ^ and the bubble 
partition matrix from Eq. (fT3)) . Note that, for reasons of 
visualization, we have applied for each row a normaliza- 
tion approach in these two figures. The normalizing con- 
stants, which are the maxima in each row, are depicted in 
the panels below. Considering these results, one can see 
that the probability for bubbles is approximately exponen- 
tially decreasing as function of the bubble size. The bub- 
bles of size ten have probabilities of the order of ~ 10~ 4 . 
This explains the difficulties of previous MD results [T] as 
the detection of such a large bubble is a true rare event 




base-pair 



^TSS 



10 + 
10 



bubble size 



normalization constant 

20 25 30 



base-pair 



'LTSS 



10 



10 



normalization constant 



bubble size 



20 



25 



30 



Fig. 2. (Color online) Bubble statistics matrices for the 
AAVP5 promoter sequen ce for T = 300K and openings 
threshold £ = 1 A. Top two panels show the bu bble matrix 
Pbub(k, m) of Eq. (|12[1 and the lower two panels show the bub 
ble partition matrices P P art(fc,m) of Eq. fj 13 j) . Each row m of 
the first and thir d panel is normalized by the maximum value 
of the matrix at the given bubble siz e m. The normalization 
constants as function of m is depicted in the panels below. 



on the time-scale accessible by MD. On the other hand, 
the numerical integration method allows to obtain accu- 
rate results for even much larger bubbles. This can be 
important for the study of biological phenomena as, for 
instance, transcription elongation involves DNA openings 
that are larger than ten bases [H]. The method allows to 
obtain accuracies of less than one percent error after only 
a few hours of computation which would otherwise take 
200 years when using MD [3]. 

Although Fig.[5]shows indeed somewhat enhanced open- 
ing in the biologically active regions, it shows that it is 
certainly not true that the TSS has a much higher open- 
ing probability than the other sites as was found in the 
foregoing less accurate MD results pQ. In fact, the -30 re- 
gion shows equal probabilities for opening and and even 
higher probabilities when bubbles of size ~ 10 are con- 
sidered. Inspection of the lowest rows in Fig. [2] basically 
reflects the AT-rich parts of the sequence. The position of 
the preferential opening for the larger bubbles can be rea- 
sonably understood as a merging effect; two small bubbles 
that are close in distance act as the precursor of a larger 
bubble whose center is in the middle of the two smaller 
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ones. The P par t matrix has considerable less structure, but 
shows the same tendency. 

To investigate whether promoter sites are special in 
terms of its bubble probability profile, in Refs. |l|3j a hu- 
man coding gene, known to be free of any protein interac- 
tion sites, was examined. The initial results suggested that 
this sequence had much lower probability for bubbles pQ, 
but the direct integration method showed that the ability 
for bubble formation was certainly comparable in magni- 
tude to the promoter sequences [3]. Here, we study two 
other artificial non-promoter sequences. These are the fol- 
lowing two complementary Fibonacci sequences: 

Fibonacci- 1 : AC A AC AC AAC A AC AC A AC AC A AC A A 
CACAACAACACAACACAACAACACA 
ACACAACAACACAACAACACAACAC 
AACAACACAACAAC 

and 

Fibonacci-2: CACCACACCACCACACCACACCACC 
ACACCACCACACCACACCACCACAC 
CACACCACCACACCACCACACCACA 
CCACCACACCACCA 

which have a total length of 89 and a AT content of 62 
% and 38 % respectively. The choice for Fibonacci has 
been made to analyze the hypothetical enhanced opening 
of biological sequences in comparison with a "random" se- 
quence. However, as a typical random sequence is poorly 
defined, one could come up with any sequence and basi- 
cally "prove" what one wants. Therefore we studied the 
Fibonacci sequences rather than two sequences produced 
by a random number generator. Although those Fibonacci 
sequences are far from random, they are sufficiently disor- 
dered and have the advantage that they doe not contain 
very long weak or strong regions due consecutive repeti- 
tions. In addition, we strictly rule out that the possibility 
we pick by accident a sequence that is biologically active 
as well. 

In fig. 02 we show the results of P part (fc, 1), P par t(fc, 5), P p . 
and Ppart (k, 15) for the Fibonacci sequences together with 
the results for the AAVP5 promoter. The first panel shows 
Ppart (k, 1) which equals the simple opening probability of 
the individual base in the sequences. It shows that the pro- 
moter sequence has some regions that have a considerably 
higher affinity to open up than the Fibonacci sequences. 
This is a result of the presence of longer consecutive AT 
regions in the AAVP5 promoter. The Fibonacci-1 and 
Fibonacci-2 sequence have at most 2 or 1 consecutive weak 
base-pairs in a row. When we examine larger bubbles, we 
see that the base-specific order of the sequences becomes 
less important. The extend of the bubble averages out the 
effect of the precise order of the weak and strong bases. 
Hence, the openings probability profile becomes more and 
more determined by the AT content. This is clearly illus- 
trated by the fact that the promoter sequence's probabil- 
ity profile for bubbles of size 15 remains strictly within 



the two profiles of the Fibonacci sequences at all sites. 
Hence, the chance to find a bubble of 15 is at each loca- 
tion higher in the AT-rich Fibonacci sequence than in the 
AAVP5 promoter, despite the absence of long series with 
consecutive weak AT bases. This also suggest that the 
bubble statistics, at least within the PBD framework, is 
reasonably predictable by some simple rules based on the 
AT content. Indeed, Rapti et al. [45l46j suggest that these 
PBD bubble profiles could be qualitatively reproduced by 
counting the number of AT-bases within a certain window 
that is a bit larger than the bubble size considered. Ac- 
tual DNA in solution seems to be less predictable on basis 
of the AT content alone. The denaturation steps in long 
heterogeneous DNA polymers are very sensitive to the se- 
quence [5] and can qualitatively change when only one 
base-pair is changed. The experimental part of Ref. [TJ also 
suggest that actual DNA bubble statistics retains strong 
non-local effects. A prerequisite for the understanding of 
these result would require a more precise interpretation of 
the measurements by the SI nuclease cleavage technique 
expressed in microscopic terms. The experimental signal 
might well be related to some of the definitions (TT2")) and 
(fl~3l , but probably not straightforwardly. Many questions 
remain such as which range of bubbles can be detected 
by SI nuclease cleavage, where in the bubble takes the 
cleavage place, is bubble life-time important, and many 
more. Much more systematic studies are needed. The re- 
sults of Fig. [3] show that the study of artificial sequences, 
such as the Fibonacci sequences, can reveal different struc- 
tures depending on the size of bubbles that are detected. 
Hence, experimental measurements on artificial periodic 
and quasiperiodic sequences might be very useful to give 
some answers to these intriguing questions. 
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bubble size=15 
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base-pair index base-pair index 

Fig. 3. (Color online) Bubble statistics, 
P part (fc,l),P part (fc,5),P par t(fc,10) and (fc, 15), of the 

AAVP5 promoter and the Fibonacci sequences. 

To summarize this section, our results on the bubble 
statistics using the accurate direct integration method do 
not indicate that biologically active sites have a stronger 
thermally induced enhanced opening than one would ex- 
pect based on the AT content of the sequences. We also 
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examined the effect of higher temperatures upto 350 K 
and different openings thresholds upto £ = 2 A. However, 
the results remained qualitatively the same. Of course, 
this does not necessarily mean that there is not such a re- 
lation as this would first require a validation of the model. 
Therefore, in the next section, we will study the theoret- 
ical results of the ^-denaturation curves that can give a 
more direct comparison with experimental data than the 
bubble statistics (fTU) and fill). 



Fig. 



7 Denaturation curves 

As explained in Sec.[3j the denaturation curves f(T) and 
p(T) of Eq. ([8]) cannot be determined within the PBD 
framework. Luckily, the l(T) denaturation curve can be 
calculated using the PBD model and can be measured 
as well using a recently introduced experimental tec h- T fin 
nique |23|25|24j . For several sequences, Montrichok et al. p3 25 2 
reported some anomalous behavior of I as function of T. 
These experimental results are, hence, an excellent bench- 
mark to test the validity of the PBD model. In this sec- 
tion, we show the calculated l(T) curves for the L60B36, 
L42B18, L33B9, and L48AS given by 




L48AS 



60 80 100 

Temperature (°C) 



60 80 o 100 

Temperature (°C) 



4. / versus temperature for 4 heterogeneous sequences 
36, L42B18, L33B9, and L48AS using four different open- 
reshold definitions £ = 0.5, 1,1. 5, and 2 A. Previous MD 
results of Ares et al. [40] were obtained using £ = 0.5 and the 
different definition of opening Eq. (|10[l . The inset in the right 
upper panel shows the Morse potential Vk of Eq. ([2]) for the 
weak and strong base-pair interaction. 



L60B36: CCGCCAGCGGCGTTATTACATTTAA 
TTCTTAAGTATTATAAGTAATATGGC 
CGCTGCGCC 
L42B18: CCGCCAGCGGCGTTAATACTTAAGT 
ATTATGGCCGCTGCGCC 
L33B9: CCGCCAGCGGCCTTTACTAAAGGCC 

GCTGCGCC 
L48AS: CATAATACTTTATATTTAATTGGCG 
GCGCACGGGACCCGTGCGCCGCC 

In Fig. 21 we show the calculated results for these sequence 
using four values of £: 0.5, f .0, f .5 and 2.0 A. It is impor- 
tant to note that the different opening threshold values 
considered do not change the qualitative behavior of the 
curves. The curves with £ = 2 A intersect the lower thresh- 
old value curves £ = f A and £ = 1.5 A. This might 
seem impossible as each base k, that is counted as open 
because y k > £ = 2 A, must also be open when a lower 
opening threshold value is considered. However, we should 
bear in mind that £ not only determines the definition of 
'open' and 'closed', but also determines the ensemble via 
Eqs. (|5l6p . Considering Eq. ([T4]) . it is certainly true that 
Zg k (R) is strictly decreasing as function of £. However, 
Z 7! {R) is strictly decreasing as well and, hence, the ratio 
[Zg k (R) — Zg k (R)]/[Z{R) — Zg k (R)} can actually increase 
as function of £. 

The experimental results for the L60B36 and L42Bf8 
sequences contained a remarkable change of slope [23125124] 
This effect could indicate that the melting appears in two 
steps in which first an AT rich part of the sequence opens 
up and is then followed by a GC rich region in the se- 
quence. Our results do not show this signature. This is 



in contrast with another computational study by Ares et 
al. [40] which does report some of the experimentally 
found characteristics. However, the change of slope that 
they found was negative for the both sequences L60B36 
and L42Bf 8, while the experimental results showed a very 
sharp positive change of slope in the L42B18 sequence at 
a temperature of 70 °C. Still, the results of Ares et al., 
for the same sequences we studied, seem to resemble more 
closely the experimental results than the ones by us. This 
variance is explained by the following three reasons: (i) 
Ares et al. used the alternative definition of 'open states' 
as expressed by Eq. ([TO]) instead of Eqs. [819] . (ii) They 
applied a selective use of boundary conditions which were 
periodic boundary conditions for the sequences L60B36, 
L42B18, L33B9 and free boundaries, as in this work, for 
the sequence L48AS. (iii) Ref. [40] allowed for complete 
denaturation as it was based on a series of short MC sim- 
ulations without the use of a bias-potential as in Eq. . 
In fact, the work [40] even report on the / and p curves, 
which cannot unambiguously be determined as we pointed 
out in Sec. Hence, the deviation from the experimental 
results must imply that the present PBD model is insuffi- 
cient to reproduce these non-trivial sequence specific order 
effects. 

However, the experimental results themselves raise some 
questions. If we were allowed to neglect the DNA-DNA in- 
teraction, the l(T) curve seems to provide a signature that 
is theoretically independent to the concentration of DNA. 
This is exactly why l(T) can be determined within the 
PBD framework. Still, it would be interesting to verify ex- 
perimentally whether the l(T) curve is indeed insensitive 
to this concentration. Moreover, some of the experimental 
results are a bit puzzling. The experimental f(T) andp(T) 
denaturation curves of the L33B9 sequence, for instance, 
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coincide at 75 °C while still /(75 °C) = p(75 °C) < 1 
As I = (/ -p)/(l-p), this would imply that l(T) = for 
T > 75 °C. This finding seems to be unphysical and this 
is probably also the reason that Montrichok et al. have 
depicted the l(T) curve until T = 75 °C in Ref. [SJ. This 
indicates that one has to be careful when translating the 
UV absorbance experiments in microscopic terms using 
Eqs. I[8l9p . The theoretical development in this field would 
benefit significantly if more experimental data based on 
the quenching technique were available. 

8 Conclusions 

The statistics of thermally induced DNA bubbles has be- 
come an important subject of theoretical and experimen- 
tal studies. Besides the fact that it is a interesting subject 
from a purely statistical physics point of view, the rela- 
tion between thermally induced bubbles and biologically 
active sites has been subject of recent debate. Mesoscopic 
models, like the PBD model, are a prerequisite in these 
studies as the experimental data can usually only give 
indirect information. However, even if a good theoretical 
model is developed, it is not easy to obtain accurate re- 
sults as large bubbles occur only seldom in a microscopic 
system. In previous publications, the inaccuracy inherent 
to MD have lead to premature conclusions such that the 
TSS has a much stronger affinity to form bubbles than 
any other arbitrary site 1 112 j - In a recent publication by 
us [3], we showed, using a new statistical method that is 
orders of magnitude faster than MD, that this statement 
had to be reconsidered. Although the biologically active 
sites have some enhanced opening due to their relative 
high content of weak AT base-pairs, the bubble probabil- 
ity profile given by the PBD model was certainly not suf- 
ficient to make accurate predictions on transcription sites 
or to discriminate between biologically active and inactive 
sequences. Hence, this implies that either the biologically 
active sites cannot be assigned by the information of ther- 
mally induced bubbles alone or the actual PBD model 
is insufficient to describe all the sequence specific effects 
correctly. The SI nuclease experiments seem to suggest 
a correlation between bubbles in vitro and transcription 
sites. It is, however, not exactly clear how the SI nuclease 
measurements should be translated in microscopic terms 
that can be calculated by computer experiments. 

In this article, we have revisited the direct numerical 
integration technique that was introduced in [3] . We have 
given a detailed explanation of the algorithm and inves- 
tigated the performance of different integration schemes. 
Although the higher order Newton-Cotes schemes are bet- 
ter for very high precision results with many digits, the 
simple Simpson ^-rule or Boole's rule are more efficient if 
only an accuracy of a few percent is required. The opti- 
mal result is obtained when the Simpson's or Boole's rule 
is combined with the simple rectangular rule. The latter 
is used when the function vanishes at the two integration 
boundaries. Moreover, we have given a thorough discus- 
sion on how to treat finite chains using the PBD model 
by introducing the double stranded DNA ensemble. This 



eliminates all the problems due to the unnormalizability of 
equilibrium distribution in the full space and gives results 
that can be compared by experiments performed below 
the melting temperature. 

Within this ensemble, we have defined two types of 
bubble probabilities. Pb u b{k,m) is the probability that 
a bubble of exactly size m is centered at base-pair k. 
-Ppart(fc, m) is the participation probability that site k is in- 
side a bubble of at least m bases long. Our analyses on the 
AAVP5 promoter sequence and two artificial Fibonacci se- 
quences confirm what we found before [3] . No theoretical 
evidence was found that bubbles appear more frequently 
at transcription sites than at other sites that have a similar 
AT content. When larger bubbles are considered, the effect 
of sequence specific order becomes even less important. A 
recent theoretical study of Rapti et al. [55] confirms this 
and reveals that the PBD bubble statistics profile can be 
qualitatively reproduced by counting the number of AT 
within a certain window that is larger than the bubble 
size. The questions remains whether this is also true for 
actual DNA. The SI nuclease experiments suggest that 
the behavior of real DNA is more complicated than that. 

To study the validity of the PBD model, we applied our 
method to calculate the so-called /-denaturation curves 
that allow to make a more direct comparison to exper- 
imental results. As argued, the standard /-denaturation 
curve cannot be obtained without additional parameters 
due the problem of normalizability for finite DNA chains. 
Luckily, the /-denaturation curves can be measured as well 
via a recently introduced quenching technique [23 2 4125) . 
Our theoretical calculations did not reproduce the experi- 
mentally found anomalies of the l(T) denaturation curve. 
This points out a significant weakness of the present PBD 
model. This also implies that the bubble hypothesis postu- 
lated by Choi et al. PQ could still be supported by theoreti- 
cal evidence whenever an 'ideal' DNA model is considered. 
The indirect evidence of the SI nuclease experiments is yet 
insufficient to make this statement absolute as its meaning 
in terms of microscopic terms is not yet completely un- 
derstood. It is also difficult to believe that the statement 
holds for all TSS as some transcription sites are known 
that consists of at least three consecutive strong base in 
a row [17] . More systematic experimental and theoretical 
studies are required. 

Theoretical improvement can probably be achieved when 
a more complicated stacking interaction is taken into ac- 
count. We found that some of the anomalies found by 
Montrichok et al. 23 2 1 25 could be reproduced using a 
different base-pair specific stacking potential W(yk,y-i) 
([2|) [48]. However, more complicated potentials might be 
needed. It is important to note that the direct integration 
method is not restricted to the PBD model only. It can 
be used whenever the proper factorization (j 1 6(1 can be 
applied. Our preliminary results indicate that the PBD 
model could be improved considerably while still main- 
taining the one-dimensional character of the model. This 
implies that the direct integration method could still be 
applied for this new class of models and will, hence, proba- 
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bly remain an important method for the future theoretical 
developments in this field. 
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A Newton-Cotes integration schemes 

In SecEl we have given the derivation of the direct inte- 
gration method upto the numerical implementation which 
basically comprises an iterative operation of Eq. (|24|) . The 
vector fj depends on choice of Newton-Cotes integration 
scheme. In general, the Newton-Cotes numerical integra- 
tion approximates any integral over an finite range J g(x) dx 
by Ay Y^=o /* (° + i^y) with n — (b — a) /Ay. From the 
various Newton-Cotes schemes, we will discuss the simple 
rectangular rule, Simpson —rule, Boole's rule, and the 11- 
point Newton-Cotes formula. The corresponding fi vec- 
tors are listed below. 



The right choice of integration scheme can significantly 
improve the precision of the method. One cannot say in 
advance that the highest order scheme is always prefer- 
able. This can depend on the shape of the function g, the 
applied integration boundaries, and the required precision. 
In order to study the accuracy of the integration methods, 
we applied the different schemes (|25M29|) on the standard 
integral J a °° e~ x dx where we take a = — oo, and 1. We 
take a numerical cut-off such that \x\ < 10 on the integra- 
tion domain. In general, the higher order Newton-Cotes 
numerical integration schemes require that the total num- 
ber of integration intervals n must be multiples of a certain 
value. These are 2, 4, and 10 for, respectively, Simpson's 
rule, Boole's rule and 11-point Newton-Cotes. However, 
as the function vanishes at the right boundary [x = 10 
in our numerical approach) we can take the semi-infinite 
analogue where we start with /o at the point x = a (or 
x = —10 if a = — oo) and then simply continue with 
/i,/2,... until the point x = 10 without requiring the 
correct ending /„ = / . 



Rectangular rule: 

Trapezoidal rule: 

fi- 

Simpson's | rule: 

h = I X 
Boole's rule 125] : 



fi = — x 
H 45 



fi = 1 for all i, 



i for i = 0, n 
1 fori = 1,2,3,. 
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and the 11-point Newton-Cotes rule [50] : 
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Fig. 5. (color online) The absolute error for the integration 
of exp(— x 2 ) Ax as function of the number of intervals n in 
the numerical approach. The rectangular rule (r), trapezoidal 
rule (t), Simpson's |-rule (S), Boole's rule (B), and the 11 
point Newton-Cotes scheme (N) are compared. Three cases 
are considered: a = — oo (top), a — (middle), and a = 1 
(bottom). The horizontal plateau in the first two panels is a 
result from the cut-off at ±10. 



In fig. [5] we have plotted the integration errors as func- 
tion of n obtained by the five Newton-Cotes methods and 
the three values of a. We see that the highest order scheme 
is not always the best choice. In fact, for the integration 
over the full range (a = — oo), the simple rectangular rule 
is identical to the trapezoidal rule, but far superior to the 
other methods (|27H29[) . Naturally, as the function vanishes 
at both ends, the result would not change much upon shift- 
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ing the initial point to — 10 + Ax. Averaging over several 
shifts using Eqs. (|27H29|) results in a weighted summation 
that approaches the simple rectangular rule (|25|) . The op- 
timum performance of the rectangular rule on the infinite 
domain is, hence, not surprising. 

The trapezoidal rule gives the optimal result for the 
case a = 0. Also this is not too surprising as the function 
is symmetric and the trapezoidal rule is exactly half the 
result of the rectangular rule over the full domain. For 
a = 1 we find, as expected, that the 11-point Newton- 
Cotes method gives the best result. However, only at large 
n the difference becomes apparent. 

We also analyzed the performance of the different Newton- 
Cotes schemes for the bubble statistics in the PBD model. 
As a benchmark, we compared the calculated values of 
I ([9]) at temperature T = 300 K and threshold opening 
£ = 1 A for a 10 base-pair long homogeneous AT chain 
with free boundaries. Considering previous results, we al- 
ways applied the rectangular rule for the integrals in (|16[) 
when the integrated function vanishes at both integra- 
tion or cut-off boundaries in yk-i- These are either at 
Uk-i = L or at yh—i = yk ± d. When the integrated func- 
tion only vanishes at one end, we applied the semi-infinite 
variation of one of the Newton-Cotes formulas [251129] , 
It is important to notice that, as we mentioned before, 
the distribution function is not vanishing at yu-i = R. 
Hence, the Newton-Cotes rule needs to be applied at this 
boundary for an optimal accuracy; i. e. we start with /o 
at this boundary and continue in the negative direction 
R — Ay, R — 2 Ay, . . . for the numerical integration. 

The integrals with two non-vanishing boundaries ap- 
pear only for the last integrations Zx = J dyN^x^ in 
Eq. (|16|) and when j/jv must be integrated over the open 
domain only. Then, both at the left boundary yjy = £ as 
at the right boundary y^ = R, the function is not neces- 
sarily decayed below e. This also implies that R — £ is the 
only interval that must be a special multiple of Ay. This 
must be an multiple of 2 for Simpson's rule, 4 for Boole's 
rule and 10 for 11-point Newton-Cotes rule and this gives 
the restriction to the possible integer values that Ir can 
take. 

After these technical details are taken into account, 
the Newton-Cotes formulas [251129] can be applied to the 
benchmark system and allow to compare the different in- 
tegration methods. The results are depicted in table [TJ 
These show that it is certainly beneficial to go beyond the 
simple rectangular or trapezoidal rule. Although, higher 
order schemes like the 11-point Newton-Cotes are presum- 
ably better at very small values of Ay and very high pre- 
cision, at larger values of Ay the Simpson's and Boole's 
method give better results. The highest precision results 
with Ay = 0.0125 A show an accuracy of 8 digests for 
both Simpson, Boole and 11-point Newton-Cotes, while 
the computational expense is less than a minute. Such a 
performance is far beyond any MD or MC method even if 
enhanced sampling is applied [51j . 

For our purposes, an accuracy a few percent is enough. 
Therefore, considering the results of Fig. [5] and Table [TJ 
we have chosen to use Simpson's rule with a grid spacing 



of Ay = 0.1. In the results of Sec.[6]and[7j we have always 
used these parameters. 
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Ay 


0.2 


0.1 


0.05 


0.025 


0.0125 


r 


1.7059 


1.6112 


1.57187104 


1.5520500628 


1.5422425176 


t 


1.5506 


1.5339 


1.53313317 


1.5326609537 


1.5325428897 


s 


1.5015 


1.5307 


1.53253332 


1.5325035590 


1.5325035346 


B 


1.4887 


1.5333 


1.53258630 


1.5325015790 


1.5325035330 


N 


1.1940 


1.5156 


1.53183258 


1.5325009318 


1.5325035341 



Table 1. Analysis of the accuracy of the Newton Cotes integration scheme. ?(10 _1 ) for a 10 base-pair homogeneous AT chain 
is shown for different values of Ay. 5 integration schemes are compared: rectangular rule (r), trapezoidal rule (t), Simpson's 
i-rule (S), Boole's rule (B), and 11-point Newton-Cotes formula (N). 



